Customer Segmentation with PCA and Clustering¶

Project Overview¶

This project focuses on the implementation of Principal Component Analysis (PCA) and clustering techniques to gain valuable insights into a grocery firm's customer database. The ultimate goal is to perform customer segmentation, a crucial practice that involves categorizing customers into distinct groups based on shared characteristics. By doing so, we aim to enhance our understanding of customer behavior and tailor our products and services to meet their diverse needs and preferences.

Dataset Overview¶

Intduction:¶

The dataset at hand represents a snapshot of a grocery store's customer data. The company is keen on segmenting its customers to better understand their behavior and preices to different customer segments.

This refined project outline provides a structured and comprehensive overview of the goals, steps, and expected outcomes of the customer segmentation project using PCA and clusting techniques.

Linto dataset:¶

https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis

Dataset Description¶

There are four groups of information:

People
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency Complain
Customer's unique identifier Customer's birth year Customer's education level Customer's marital status Customer's yearly household income Number of children in customer's household Number of teenagers in customer's household Date of customer's enrollment with the company Number of days since customer's last purchase 1 if the customer complained in the last 2 years, 0 otherwise

Products | | | | | | | --- | --- | --- | --- | --- | --- | --- | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | Amount spent on wine in the last 2 years | Amount spent on fruits in the last 2 years | Amount spent on meat in the last 2 years | Amount spent on fish in the last 2 years | Amount spent on sweets in the last 2 years | Amount spent on gold in the last 2 years |

Promotion | | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | NumDealsPurchases | AcceptedCmp1 | AcceptedCmp2 | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | Response | Number of purchases made with a discount | 1 if the customer accepted the offer in the 1st campaign, 0 otherwise | 1 if the customer accepted the offer in the 2nd campaign, 0 otherwise | 1 if the customer accepted the offer in the 3rd campaign, 0 otherwise | 1 if the customer accepted the offer in the 4th campaign, 0 otherwise | 1 if the customer accepted the offer in the 5th campaign, 0 otherwise | 1 if the customer accepted the offer in the last campaign, 0 otherwise |

Place | | | | | --- | --- | --- | --- | --- | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | Number of purchases made through the company’s website | Number of purchases made using a catalogue | Number of purchases made directly in stores | *Number of visits to the

Key Steps in the Project¶

  1. Data Exploration:

    • Explore the structure and characteristics of the dataset to gain initial insights.
  2. Data Preprocessing:

    • Handle missing values, outliers, and any necessary data transformations.
  3. Principal Component Analysis (PCA):

    • Implement PCA to reduce the dimensionality of the dataset while retaining essential information.
  4. Clustering:

    • Utilize clustering algorithms (e.g., K-means) to segment customers based on their purchasing behavior.
  5. Interpretation:

    • Interpret the results of the clustering and visualize customer segments for better understanding.
  6. Business Recommendations:

    • Provide actionable recommendations based on the identified customer segments to enhance marketing and service strategies.

Expected Outcomes¶

  • A clear segmentation of customers into distinct groups.
  • Insights into the key features driving customer segmentation.
  • Practical recommendations for tailoring products and services to different customer segments.company’s website in the las## Table of Contents
  1. Import Libraries and Get to know Dataset

    • 1.1. Import necessary Libraries and Packages
    • 1.2. Get basic information about the dataset
    • 1.3. Examine unique values of each column
    • 1.4. Summary of the dataset
  2. Data Visualization and Transformation

    • 2.1. Visualize and transform categorical features
    • 2.2. Visualize and transform numerical features
    • 2.3. Drop redundant variables
    • 2.4. Correlation Analysis
  3. Preprocessing

    • 3.1. Encoding
    • 3.2. Scale the Data
  4. Dimensionality Reduction

    • 4.1. Initialize PCA object
    • 4.2. Identify the Optimal Number of Components
    • 4.3. Conclusion about n_components
  5. Customer Segmentation using K-Means Clustering

    • 5.1. Initialize model
    • 5.2. Choose k
    • 5.3. Determining the Optimal Number of Clusters: Elbow vs. Silhouette
  6. Evaluate the Clusters

  7. **Final conclusion**

1. Import Libraries and Get to know Dataset¶

1.1. Import necessary Libraries and Packages¶

In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.float_format', '{:.2f}'.format)
pd.set_option('display.max_columns', None)

import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer

sns.set_theme(style='white')
%config InlineBackend.figure_format = 'retina'
plt.rcParams["axes.spines.right"] = False
plt.rcParams["axes.spines.top"] = False

#import warnings
#warnings.filterwarnings("ignore")

1.2. Get basic information about the dataset¶

In [2]:
original_df = pd.read_csv('marketing_campaign.csv', delimiter='\t')
print(original_df.shape)
original_df.head()
(2240, 29)
Out[2]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact Z_Revenue Response
0 5524 1957 Graduation Single 58138.00 0 0 04-09-2012 58 635 88 546 172 88 88 3 8 10 4 7 0 0 0 0 0 0 3 11 1
1 2174 1954 Graduation Single 46344.00 1 1 08-03-2014 38 11 1 6 2 1 6 2 1 1 2 5 0 0 0 0 0 0 3 11 0
2 4141 1965 Graduation Together 71613.00 0 0 21-08-2013 26 426 49 127 111 21 42 1 8 2 10 4 0 0 0 0 0 0 3 11 0
3 6182 1984 Graduation Together 26646.00 1 0 10-02-2014 26 11 4 20 10 3 5 2 2 0 4 6 0 0 0 0 0 0 3 11 0
4 5324 1981 PhD Married 58293.00 1 0 19-01-2014 94 173 43 118 46 27 15 5 5 3 6 5 0 0 0 0 0 0 3 11 0
In [3]:
df = original_df.copy()
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   int64  
 16  NumWebPurchases      2240 non-null   int64  
 17  NumCatalogPurchases  2240 non-null   int64  
 18  NumStorePurchases    2240 non-null   int64  
 19  NumWebVisitsMonth    2240 non-null   int64  
 20  AcceptedCmp3         2240 non-null   int64  
 21  AcceptedCmp4         2240 non-null   int64  
 22  AcceptedCmp5         2240 non-null   int64  
 23  AcceptedCmp1         2240 non-null   int64  
 24  AcceptedCmp2         2240 non-null   int64  
 25  Complain             2240 non-null   int64  
 26  Z_CostContact        2240 non-null   int64  
 27  Z_Revenue            2240 non-null   int64  
 28  Response             2240 non-null   int64  
dtypes: float64(1), int64(25), object(3)
memory usage: 507.6+ KB

Change Year_Enrolled to datetime¶

In [5]:
df['Year_Enrolled'] = pd.to_datetime(df['Dt_Customer'], format='%d-%m-%Y')
In [6]:
missing_encountered = False

for col in df.columns:
    missing = df[col].isnull().sum()
    if missing > 0:
        print(f'{col}: {missing} missing values.')
        missing_encountered = True
        
if not missing_encountered:
    print("No missing values.")
Income: 24 missing values.

The dataset comprises 29 columns, out of which 25 are integer variables, 3 are objects, and 1 is a float. And Income variable is the only one having missing values.

In [7]:
df['Income'] = df['Income'].fillna(df['Income'].median())

1.3. Examine unique values of each column¶

In [8]:
data = []
for col in df.columns:
    unique_values = df[col].unique()
    num_unique = len(unique_values)
    dtype = df[col].dtype
    if num_unique > 10:
        data.append((col, num_unique, "", dtype))
    else:
        unique_values_str = ", ".join(map(str, unique_values))
        data.append((col, num_unique, unique_values_str, dtype))

uniques = pd.DataFrame(data, columns=["Column", "# of Uniques", "Unique Values (<10)", "Dtype"])
uniques.sort_values(by='# of Uniques').reset_index(drop=True)
Out[8]:
Column # of Uniques Unique Values (<10) Dtype
0 Z_Revenue 1 11 int64
1 Z_CostContact 1 3 int64
2 AcceptedCmp2 2 0, 1 int64
3 Response 2 1, 0 int64
4 AcceptedCmp1 2 0, 1 int64
5 AcceptedCmp5 2 0, 1 int64
6 AcceptedCmp3 2 0, 1 int64
7 Complain 2 0, 1 int64
8 AcceptedCmp4 2 0, 1 int64
9 Teenhome 3 0, 1, 2 int64
10 Kidhome 3 0, 1, 2 int64
11 Education 5 Graduation, PhD, Master, Basic, 2n Cycle object
12 Marital_Status 8 Single, Together, Married, Divorced, Widow, Al... object
13 NumCatalogPurchases 14 int64
14 NumStorePurchases 14 int64
15 NumDealsPurchases 15 int64
16 NumWebPurchases 15 int64
17 NumWebVisitsMonth 16 int64
18 Year_Birth 59 int64
19 Recency 100 int64
20 MntFruits 158 int64
21 MntSweetProducts 177 int64
22 MntFishProducts 182 int64
23 MntGoldProds 213 int64
24 MntMeatProducts 558 int64
25 Year_Enrolled 663 datetime64[ns]
26 Dt_Customer 663 object
27 MntWines 776 int64
28 Income 1975 float64
29 ID 2240 int64

1.4. Summary of the dataset¶

  • 1 Unique Value: Variables Z_Revenue and Z_CostContact exhibit a singular value (11 and 3 respectively), implying no contribution to our clustering analysis. Consequently, I will remove them from the dataset.

  • 2 Unique Values: Variables AcceptedCmp1, AcceptedCmp2, ... to AcceptedCmp5, Complain, and Response adopt binary options 0 and 1. I intend to visualize them to assess the potential for combining these variables.

  • 3 Unique Values: Variables Teenhome and Kidhome can be combined into a single variable named Dependents, which implies number of dependents in a family.

  • Education: The Education variable entails five categories: 'Graduation', 'PhD', 'Master', 'Basic', and '2n Cycle'. After researching the significance of 'Basic' and '2n Cycle' education levels:

    • 'Basic' signifies foundational or primary education. This category will be retained as is.
    • '2n Cycle' represents advanced studies beyond a bachelor's degree (1st Cycle). It will be merged with the 'Master' category.
  • Marital_Status: With eight categories, namely Single, Together, Married, Divorced, Widow, Absurb, YOLO, I propose merging them into two categories: 'Relationship' and 'Single'.

  • ID: This variable is deemed non-contributory and will be excluded from further analysis.

  • Year_Birth: To enhance interpretability, I will transform the Year_Birth variable into Age using the maximum engagement day as a reference threshold.

  • Year_Enrolled: Similar to Year_Birth, I will create a new variable named Days_Enrolled by employing the same methodology.

  • Other categorical variables that encompass more than 10 options will remain unaltered.

2. Data Visualization and Transformation¶

In this stage, I will conduct data cleaning as mentioned above:

Before that, I will plot some chart:

2.1. Visualize and transform categorical features¶

In [9]:
categorical_cols = [
    'Kidhome', 
    'Teenhome', 
    'AcceptedCmp1', 
    'AcceptedCmp2', 
    'AcceptedCmp3', 
    'AcceptedCmp4', 
    'AcceptedCmp5',
    'Complain', 
    'Response',
    'Education',
    'Marital_Status'
]

fig, axes = plt.subplots(3, 4, figsize=(16, 12))

for i, col in enumerate(categorical_cols):
    order = df[col].value_counts().index
    ax = axes[i // 4, i % 4]
    sns.countplot(data=df, x=col, ax=ax, order=order, palette='magma', hue=col, legend=False)
    ax.set_title(col)
    ax.set_xlabel('')
    ax.set_ylabel('')
    
    if col == 'Marital_Status':  
        ax.tick_params(axis='x', rotation=45)

plt.tight_layout()
No description has been provided for this image

2.1.1. Summary of the Countplot¶

  • The countplot analysis reveals significant imbalances in the distribution of values within categorical variables.

  • Kidhome and Teenhome

    • Combining Kidhome and Teenhome might be beneficial for a more consolidated representation of the number of children in different age groups.
  • AcceptedCmp1 to AccepteCmp5

    • Considering combining AcceptedCmp1 to AcceptedCmp5 due to their apparent similarity, potentially simplifying the analysis and interpretation.
  • Education and Marita_Status

    • There is a need for cleaning or grouping within the Education and Marital_Status variables to address potential inconsistencies or improve categorization.
  • Complaints and Surey Response

    • Most customers did not register any complaints.
    • A relatively small proportion of the total customer base (around 300 out of 2240) responded to the survey in the Response

Below is more detailed analysis to each variable:

2.1.2. Examine Complain¶

In [10]:
df['Complain'].value_counts()
Out[10]:
Complain
0    2219
1      21
Name: count, dtype: int64

The highly imbalanced distribution of the Complain variable (21 occurrences of 1 and 2219 occurrences of 0) might not contribute meaningful insights to customer segmentation so I will drop this variable.

2.1.3. Examine AcceptedCmp1, AcceptedCmp2, AcceptedCmp3, AcceptedCmp4 and AcceptedCmp5¶

In [11]:
accepted = df[['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5']]
plt.figure(figsize=(8, 6))
sns.heatmap(accepted, cbar=False)
plt.title('Accepted Campaigns');
No description has been provided for this image

2.1.4. Accepted_Campaign variable to sum up the fives¶

In [12]:
df['Accepted_Campaign'] = df[['AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2']].sum(axis=1)
prop_accepted_campaign = df['Accepted_Campaign'].value_counts(normalize=True)

sns.countplot(x=df['Accepted_Campaign'], palette='magma', hue=df['Accepted_Campaign'], legend=False)
for i, prop in enumerate(prop_accepted_campaign):
    plt.text(i, prop * len(df), f'{prop:.2%}', ha='center', va='bottom', color='black', fontsize=12)
plt.title('Distribution of Total Accepted Campaigns');
No description has been provided for this image

The distribution of accepted campaigns among customers is skewed, with most customers (approximately 79%) not accepting any campaigns. Around 15% of customers accepted one campaign, while smaller proportions engaged with two, three, or four campaigns, with the latter category having negligible representation.

2.1.5. Create Dependents variable based on Kidhome and Teenhome¶

In [13]:
df['Dependents'] = df['Kidhome'] + df['Teenhome']

2.1.6. Grouping Marital_Status¶

In [14]:
group_mapping = {
    'Married': 'Relationship',
    'Together': 'Relationship',
    'Single': 'Single',
    'Divorced': 'Single',
    'Widow': 'Single',
    'Alone': 'Single',
    'Absurd': 'Single',
    'YOLO': 'Single'
}

df['Marital_Status'] = df['Marital_Status'].map(group_mapping)

2.1.7. Grouping Education into smaller groups¶

In [15]:
education_mapping = {
    'Graduation': 'Graduation',
    'PhD': 'PhD',
    'Master': 'Master',
    'Basic': 'Basic',
    '2n Cycle': 'Master'  
}

df['Education'] = df['Education'].map(education_mapping)
In [16]:
plt.figure(figsize=(12, 3.5))

categories = ['Dependents', 'Marital_Status', 'Education']
palette = 'magma'

for i, col in enumerate(categories, 1):
    plt.subplot(1, 3, i)
    order = df[col].value_counts().index
    sns.countplot(data=df, x=col, palette=palette, order=order, hue=col, legend=False)
    plt.title(f'Count of {col}')
    plt.ylabel('')

plt.tight_layout();
No description has been provided for this image

2.1.8. Summary of Newly-Created Categorical Data¶

  • Dependents

    • The majority of customers have 1 dependent, with over 1000 individuals falling into this category.
    • Customers with 0 dependents are the next most common, totaling around 600.
    • The count for customers with 2 dependents ranks third, comprising approximately 400 individuals.
    • The count for customers with 3 dependents is the lowest, and there are no instances of customers having more than 3 dependents.
  • Marital Status

    • The dataset shows a higher number of customers in a relationship compared to single customers, with nearly 1500 and 750 individuals, respectively.
  • Education

    • Distribution shape of Education is similar to Dependents, with the highest number of customers fall into the "Graduation" category.
    • "Master" and "PhD" categories have the second and third-highest counts, with around 600 and 500 customers, respectively.
    • The "Basic" education category has the lowest count among the education levels. g the customers.

2.2. Visualize and transform numerical features¶

In [17]:
numeric_cols = [
    'Income', 'Recency', 
    'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 
    'NumDealsPurchases', 
    'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 
    'NumWebVisitsMonth'
]

fig, axes = plt.subplots(5, 3, figsize=(15, 20))

for i, col in enumerate(numeric_cols):
    ax = axes[i // 3, i % 3]
    sns.histplot(data=df, x=col, ax=axes[i // 3, i % 3])
    ax.set_xlabel('')
    ax.set_ylabel('')
    ax.set_title(f'Histogram of {col}')
plt.tight_layout();
No description has been provided for this image

2.2.1. Summary of the Histograms¶

  • The histograms for most features, except Recency and NumWebVisitsMonth, exhibit right-skewness with a concentration of outliers at the high end.
  • The Income histogram contains outliers, hindering a clear view of the underlying distribution. Further examination of these outliers is necessary.
  • Considering the similarities in certain features, combining them to create new variable such as total cost could enhance the analysis.
  • The uniform distribution of Recency suggests limited discriminatory power for customer segmentation.

2.2.2. Examine outliers of Income¶

In [18]:
median_income = df['Income'].median()
df['Income'] = df['Income'].fillna(median_income)
In [19]:
df[df['Income'] > 100000]
Out[19]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact Z_Revenue Response Year_Enrolled Accepted_Campaign Dependents
124 7215 1983 Graduation Single 101970.00 0 0 12-03-2013 69 722 27 102 44 72 168 0 6 8 13 2 0 1 1 1 0 0 3 11 1 2013-03-12 3 0
164 8475 1973 PhD Relationship 157243.00 0 1 01-03-2014 98 20 2 1582 1 2 1 15 0 22 0 0 0 0 0 0 0 0 3 11 0 2014-03-01 0 1
203 2798 1977 PhD Relationship 102160.00 0 0 02-11-2012 54 763 29 138 76 176 58 0 7 9 10 4 0 1 1 1 0 0 3 11 1 2012-11-02 3 0
252 10089 1974 Graduation Single 102692.00 0 0 05-04-2013 5 168 148 444 32 172 148 1 6 9 13 2 0 1 1 1 1 0 3 11 1 2013-04-05 4 0
617 1503 1976 PhD Relationship 162397.00 1 1 03-06-2013 31 85 1 16 2 1 2 0 0 0 1 1 0 0 0 0 0 0 3 11 0 2013-06-03 0 2
646 4611 1970 Graduation Relationship 105471.00 0 0 21-01-2013 36 1009 181 104 202 21 207 0 9 8 13 3 0 0 1 1 0 0 3 11 1 2013-01-21 2 0
655 5555 1975 Graduation Single 153924.00 0 0 07-02-2014 81 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 3 11 0 2014-02-07 0 0
687 1501 1982 PhD Relationship 160803.00 0 0 04-08-2012 21 55 16 1622 17 3 4 15 0 28 1 0 0 0 0 0 0 0 3 11 0 2012-08-04 0 0
1300 5336 1971 Master Relationship 157733.00 1 0 04-06-2013 37 39 1 9 2 0 8 0 1 0 1 1 0 0 0 0 0 0 3 11 0 2013-06-04 0 1
1653 4931 1977 Graduation Relationship 157146.00 0 0 29-04-2013 13 1 0 1725 2 1 1 0 0 28 0 1 0 0 0 0 0 0 3 11 0 2013-04-29 0 0
1898 4619 1945 PhD Single 113734.00 0 0 28-05-2014 9 6 2 3 1 262 3 0 27 0 0 1 0 0 0 0 0 0 3 11 0 2014-05-28 0 0
2132 11181 1949 PhD Relationship 156924.00 0 0 29-08-2013 85 2 1 2 1 1 1 0 0 0 0 0 0 0 0 0 0 0 3 11 0 2013-08-29 0 0
2233 9432 1977 Graduation Relationship 666666.00 1 0 02-06-2013 23 9 14 18 8 1 12 4 3 1 3 6 0 0 0 0 0 0 3 11 0 2013-06-02 0 1

Note:

There are 13 individuals with an income exceeding 100,000, all of whom possess education starting from 'Graduation' level and beyond. The highest reported income among them stands at 666,666.

Now let's re-plot:

In [20]:
plt.figure(figsize=(5, 3.5))

sns.histplot(data=df[~(df['Income'] > 100000)], x='Income')
plt.axvline(df['Income'].mean(), color='red', linestyle='--', label=f"Mean of {df['Income'].mean():.0f}")
plt.title('Income histogram when removing outliers')
plt.legend();
No description has been provided for this image

Note:

The distribution of Income when removing outliers now is pretty normal, with center around 52k.

2.2.3. Create Day_Enrolled¶

In [21]:
threshold = df['Year_Enrolled'].max()

df['Day_Enrolled'] = (threshold - df['Year_Enrolled']).dt.days

2.2.4. Create Age¶

In [22]:
threshold = df['Year_Enrolled'].dt.year.max()

df['Age'] = threshold - df['Year_Birth']
In [23]:
plt.figure(figsize=(10, 3.5))

plt.subplot(1, 2, 1)
sns.histplot(data=df, x='Age')
plt.title('Age histogram');
plt.subplot(1, 2, 2)
sns.histplot(data=df, x='Day_Enrolled')
plt.title('Day_Enrolled histogram')
plt.ylabel('');
No description has been provided for this image
  • The Age histogram reveals a relatively normal distribution, except for some outliers at the high end, particularly individuals aged over 100 years. Further investigation is required to understand and address these outliers.

  • In contrast, the Day_Enrolled histogram depicts a uniform distribution, indicating minimal utility for the clustering model. This feature may not significantly contribute to the segmentation process. tions.

2.2.5. Check Age's outliers¶

In [24]:
df[df['Age'] > 80]
Out[24]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact Z_Revenue Response Year_Enrolled Accepted_Campaign Dependents Day_Enrolled Age
192 7829 1900 Master Single 36640.00 1 0 26-09-2013 99 15 6 8 7 4 25 1 2 1 2 5 0 0 0 0 0 1 3 11 0 2013-09-26 0 1 276 114
239 11004 1893 Master Single 60182.00 0 1 17-05-2014 23 8 0 5 7 0 2 1 1 0 2 4 0 0 0 0 0 0 3 11 0 2014-05-17 0 1 43 121
339 1150 1899 PhD Relationship 83532.00 0 0 26-09-2013 36 755 144 562 104 64 224 1 4 6 4 1 0 0 1 0 0 0 3 11 0 2013-09-26 1 0 276 115

Note:

Only 3 people more than 80 years old but more interestingly, they are all above 110 years old, with the oldest is 121 years. I will remove them from our data.

In [25]:
plt.figure(figsize=(5, 3.5))
sns.histplot(data=df[~(df['Age'] > 80)], x='Age')
plt.axvline(df['Age'].mean(), color='red', linestyle='--', label= f"Mean of {df['Age'].mean():.0f}")
plt.title('Age histogram when removing outliers')
plt.legend();
No description has been provided for this image

Note:

The Age distribution exhibits a normal pattern after the removal of outliers.

2.2.6. Create Total_Cost¶

In [26]:
df['Total_Cost'] = df['MntWines'] + df['MntFruits'] + df['MntMeatProducts'] + df['MntFishProducts'] + df['MntSweetProducts'] + df['MntGoldProds']

plt.figure(figsize=(5, 3.5))
sns.histplot(data=df, x='Total_Cost', bins=range(0,2501,100))
plt.title('Total_Cost histogram');
No description has been provided for this image

2.3. Drop redundant variables¶

In [27]:
corr = df.corr(numeric_only=True)
mask = np.triu(np.ones(corr.shape), k=1).astype(np.bool_)
corr = corr.where(mask)
corr = corr.unstack().sort_values(ascending=False).to_frame(name='Correlation')
corr = corr[corr['Correlation'] != 1].reset_index()
corr = corr.dropna()
corr.head(20)
Out[27]:
level_0 level_1 Correlation
0 Total_Cost MntWines 0.89
1 Total_Cost MntMeatProducts 0.84
2 Total_Cost NumCatalogPurchases 0.78
3 NumCatalogPurchases MntMeatProducts 0.72
4 Accepted_Campaign AcceptedCmp5 0.72
5 Dependents Teenhome 0.70
6 Dependents Kidhome 0.69
7 Accepted_Campaign AcceptedCmp1 0.68
8 Total_Cost NumStorePurchases 0.67
9 Total_Cost Income 0.66
10 Total_Cost MntFishProducts 0.64
11 NumStorePurchases MntWines 0.64
12 NumCatalogPurchases MntWines 0.64
13 Accepted_Campaign AcceptedCmp4 0.61
14 Total_Cost MntFruits 0.61
15 Total_Cost MntSweetProducts 0.60
16 MntFishProducts MntFruits 0.59
17 NumCatalogPurchases Income 0.59
18 MntSweetProducts MntFishProducts 0.58
19 MntMeatProducts Income 0.58

Note:

It's evident that the newly-created variables are inherently correlated with the variables from which they are derived. Therefore, I've decided to retain only the total variables in order to mitigate dimensionality and simplify the model's complexity.

In [28]:
cleaned_df = df.copy().drop(
    columns=
    [
        'Year_Birth', 
        'ID', 
        'Z_CostContact', 'Z_Revenue', 
        'Dt_Customer', 
        'Year_Enrolled',
        'Complain',
        'Recency',
        'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', # Total Cost
        'AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', # Accepted_Campaign
        'Kidhome', 'Teenhome' # Total Dependents 
    ]
)

print(cleaned_df.shape)
cleaned_df.head()
(2240, 14)
Out[28]:
Education Marital_Status Income NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Response Accepted_Campaign Dependents Day_Enrolled Age Total_Cost
0 Graduation Single 58138.00 3 8 10 4 7 1 0 0 663 57 1617
1 Graduation Single 46344.00 2 1 1 2 5 0 0 2 113 60 27
2 Graduation Relationship 71613.00 1 8 2 10 4 0 0 0 312 49 776
3 Graduation Relationship 26646.00 2 2 0 4 6 0 0 1 139 30 53
4 PhD Relationship 58293.00 5 5 3 6 5 0 0 1 161 33 422

2.4. Correlation Analysis¶

In [29]:
corr = cleaned_df.corr(numeric_only=True)

plt.figure(figsize=(9, 7))
sns.heatmap(corr, annot=True, cmap=sns.diverging_palette(145, 300, s=60, as_cmap=True), 
            linecolor='w', linewidth=0.1, fmt='.2f', mask=np.triu(corr), annot_kws={'size':9});
No description has been provided for this image

Some highlights:

  1. Income and Total_Cost: The correlation between Income and Total_Cost is notably strong (0.66). This suggests that customers with higher incomes tend to have higher total costs.

  2. NumWebVisitsMonth and Total_Purchases: There is a relatively strong negative correlation between NumWebVisitsMonth and Total_Purchases (-0.43). This implies that as the number of web visits per month decreases, the total number of purchases tends to increase.

  3. Accepted_Campaign and Response: The variables Accepted_Campaign and Response have a positive correlation of 0.43. This suggests that customers who responded positively to campaigns are more likely to have accepted those campaigns.

  4. Total_Cost and Total_Purchases: Total_Cost and Total_Purchases are highly positively correlated (0.82). This indicates that higher total costs correspond to a higher number of total purchases.

  5. Age and Total_Purchases: Age and Total_Purchases have a positive correlation of 0.16. This indicates that, in general, older customers tend to have more total purchases.

  6. Dependents and NumDealsPurchases: Dependents and NumDealsPurchases have a relatively strong positive correlation (0.44). This suggests that customers with more dependents tend to make more deal purchases.

3. Preprocessing¶

Before diving into our model, we need to do some data transformation such as encode our categorical data as well as scale the numerical data.

3.1. Encoding¶

In [30]:
education_dummies = pd.get_dummies(cleaned_df['Education'], prefix='Education', dtype=int)
marital_dummies = pd.get_dummies(cleaned_df['Marital_Status'], prefix='Marital_Status', dtype=int)
education_dummies.shape, marital_dummies.shape
Out[30]:
((2240, 4), (2240, 2))

3.2. Scale the Data¶

  • I will scale only numerical data and then concatenate them back with encoded education_dummies and marital_dummies
In [31]:
numerical_names = corr.columns.tolist()
numerical_data = cleaned_df[numerical_names].copy()
numerical_data.shape
Out[31]:
(2240, 12)
In [32]:
scaler = StandardScaler()
scaled_data = scaler.fit_transform(numerical_data)
scaled_data = pd.DataFrame(scaled_data, columns=numerical_names)
In [33]:
scaled_data = pd.concat([education_dummies, marital_dummies, scaled_data], axis=1)
scaled_data.head()
Out[33]:
Education_Basic Education_Graduation Education_Master Education_PhD Marital_Status_Relationship Marital_Status_Single Income NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Response Accepted_Campaign Dependents Day_Enrolled Age Total_Cost
0 0 1 0 0 0 1 0.24 0.35 1.41 2.51 -0.55 0.69 2.39 -0.44 -1.26 1.53 0.99 1.68
1 0 1 0 0 0 1 -0.24 -0.17 -1.11 -0.57 -1.17 -0.13 -0.42 -0.44 1.40 -1.19 1.24 -0.96
2 0 1 0 0 1 0 0.77 -0.69 1.41 -0.23 1.30 -0.54 -0.42 -0.44 -1.26 -0.21 0.32 0.28
3 0 1 0 0 1 0 -1.02 -0.17 -0.75 -0.91 -0.55 0.28 -0.42 -0.44 0.07 -1.06 -1.27 -0.92
4 0 0 0 1 1 0 0.24 1.38 0.33 0.12 0.06 -0.13 -0.42 -0.44 0.07 -0.95 -1.02 -0.31
In [34]:
scaled_data.shape
Out[34]:
(2240, 18)

4. Dimensionality Reduction¶

Dimensionality reduction is a crucial technique in data analysis and machine learning. It serves various purposes, including:

  1. Curse of Dimensionality: With an increase in the number of features, data volume grows exponentially. This can lead to storage, computational, and analysis challenges. High dimensions often result in sparse data, making it harder to extract meaningful insights.

  2. Improved Model Performance: Some machine learning algorithms struggle with high-dimensional data, leading to overfitting. Dimensionality reduction can alleviate this issue by reducing noise and improving model generalization.

  3. Visualization: Visualizing data in high dimensions is challenging. Dimensionality reduction techniques project data into lower-dimensional spaces (e.g., 2D or 3D), aiding visualization and understanding.

  4. Noise Reduction: High-dimensional data often contains noise or irrelevant features. Dimensionality reduction can filter out noise and retain essential information.

In our data (scaled_data), we currently have 18 features, which represent a 18-dimensional space. Managing and interpreting data in such a high-dimensional space can be complex and resource-intensive. To address this, Principal Component Analysis (PCA) will be applied to reduce dimensionality while retaining the most critical information.

4.1. Initialize PCA object¶

Let's start with n_components = 3. Similar to my other projects, this serves as baseline model when I choose randomly a number to split our data.

In [35]:
%time
pca = PCA(n_components=3)
pca.fit(scaled_data)
CPU times: total: 0 ns
Wall time: 0 ns
Out[35]:
PCA(n_components=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(n_components=3)
In [36]:
pca_data = pca.transform(scaled_data)
pca_data
Out[36]:
array([[ 2.56736055,  1.96590984,  1.60191808],
       [-1.93787147, -0.89604036, -1.03074577],
       [ 1.6058172 , -0.37468525, -0.93492673],
       ...,
       [ 1.60791258, -1.19434595,  0.44511462],
       [ 1.571549  , -0.34181938, -1.86821953],
       [-1.09282977,  1.7029187 ,  0.97572601]])

Let's plot the result on the scree plot:

In [37]:
per_var = np.round(pca.explained_variance_ratio_ * 100, decimals=1)
labels = ['PC' + str(x) for x in range(1, len(per_var) + 1)]

plt.figure(figsize=(6, 4))
sns.barplot(x=labels, y=per_var, palette='magma', hue=labels, legend=False)
plt.ylabel('Percentage of Explained Variance (%)')
plt.xlabel('Principal Component')
plt.title('Scree Plot')
plt.tight_layout()
sns.despine()

for i, v in enumerate(per_var):
    plt.text(i, v + 1, str(v) + '%', color='black', ha='center', va='bottom')
No description has been provided for this image

Note: With the specified number of components (n=3). The variance explained by each of the first three principal components is as follows:

  • PC1: Explains approximately 31.8% of the total variance.
  • PC2: Explains approximately 13.7% of the total variance.
  • PC3: Explains approximately 10.7% of the total variance.

These percentages indicate the proportion of the original variance retained in each principal component. Since we're using the first three principal components, the PCA has captured a total of around 56.2% of the total variance in the data.

This reduction in dimensionality will help streamline your data representation while preserving a substantial portion of its variability.

In [38]:
pca_df = pd.DataFrame(pca_data, columns=labels)
pca_df

pc1_values = pca_df['PC1']
pc2_values = pca_df['PC2']
pc3_values = pca_df['PC3']

fig = px.scatter_3d(pca_df, x=pc1_values, y=pc2_values, z=pc3_values)
fig = px.scatter_3d(pca_df, x=pc1_values, y=pc2_values, z=pc3_values, title='3D Scatter Plot of PC1, PC2, and PC3')
fig.update_layout(scene=dict(xaxis_title='PC1 - {0}%'.format(per_var[0]),
                             yaxis_title='PC2 - {0}%'.format(per_var[1]),
                             zaxis_title='PC3 - {0}%'.format(per_var[2])))
fig.show();

4.2. Identify the Optimal Number of Components¶

When deciding the number of principal components to retain, we will encounter a trade-off between reducing dimensionality and preserving information. This trade-off is influenced by the threshold we set for the variance explained.

  • Higher Threshold (e.g., 95% Variance Explained): Choosing a higher threshold, like capturing 95% variance explained, emphasizes retaining as much information as possible. This leads to keeping more principal components, resulting in a higher-dimensional representation of the data. While it preserves a significant portion of the original variability, it might necessitate more components.

  • Lower Threshold (e.g., 50% Variance Explained): Opting for a lower threshold, such as 50% variance explained, prioritizes dimensionality reduction. In this case, some information is sacrificed to achieve a substantial reduction in dimensionality. Although this approach results in a lower-dimensional data representation, it might not capture all subtle variations in the original data.

  • Trade-off: If preserving the majority of the original variability is crucial and computational resources allow, a higher threshold may be appropriate. On the other hand, if reducing dimensionality is a priority and a modest loss of information is acceptable, a lower threshold can be chosen.

To determine the optimal number of components, we will apply two criteria:¶

Total Variance Criterion:¶
  • Explanation: This criterion focuses on retaining a sufficient number of principal components to explain a specified percentage of the total variance in the data. The idea is to capture the majority of the variability while reducing dimensionality.
  • I will consider retaining a number of components that collectively explain for 50 to 70% of total variance but nearly 70% as possible as I can to reduce the data loss.
Kaiser Criterion:¶
  • Explanation: The Kaiser criterion is based on the eigenvalues of the principal components. It suggests retaining components with eigenvalues greater than 1, as these components explain more variance than would be expected by chance.
  • I will also employ the Kaiser criterion as well for choosing number of components.

4.2.1. Criterion #1 - Total Variance Criterion: 50-70% Cumulative Explained Variance¶

In [39]:
def find_components_near_threshold(data, from_threshold, to_threshold=None):
    """
    Find the number of principal components that explain the specified variance range/value.

    Parameters:
        data (array-like): The scaled data.
        from_threshold (float): The lower threshold for variance explained.
        to_threshold (float): The upper threshold for variance explained (optional).

    Returns:
        str: A string indicating the number of components that explain the specified variance range/value.
    """
    pca = PCA(n_components='mle', random_state=42)
    pca.fit(data)
    
    cumulative_variance = 0
    n_from = None
    n_to = None
    
    # Return a n_component if to_threshold = None
    for n in range(1, len(pca.explained_variance_ratio_) + 1):
        cumulative_variance += pca.explained_variance_ratio_[n - 1]
        
        if cumulative_variance >= from_threshold:
            n_from = n
            break
            
    # Return a list of n_components if to_threshold != None
    if to_threshold is not None:
        for n in range(n_from + 1, len(pca.explained_variance_ratio_) + 1):
            cumulative_variance += pca.explained_variance_ratio_[n - 1]
            
            if cumulative_variance >= to_threshold:
                n_to = n
                break
    
    if n_from is None:
        return f"No components explained for at least {from_threshold*100}% variance"
    elif n_to is None:
        return f"Number of components explained for at least {from_threshold*100}% variance is {n_from}"
    else:
        return f"Numbers of components explained for {from_threshold*100}% to {to_threshold*100}% variance are {', '.join(map(str, range(n_from, n_to)))}"
In [40]:
result1 = find_components_near_threshold(scaled_data, 0.5)
result2 = find_components_near_threshold(scaled_data, 0.5, 0.7)
print(result1)
print()
print(result2)
Number of components explained for at least 50.0% variance is 3

Numbers of components explained for 50.0% to 70.0% variance are 3, 4

I will create a table showing Explained Variance by using pca.explained_variance_ratio_ so we can observe more easily:

In [41]:
variance_df = pd.DataFrame(columns=['Explained Variance', 'Cumulative Explained Variance'])

for i in range(1, len(scaled_data.columns) + 1):
    pca = PCA(n_components=i)
    pca.fit(scaled_data)
    explained_variance = pca.explained_variance_ratio_.round(2)[i-1]
    cumulative_variance = pca.explained_variance_ratio_.sum().round(2)
    variance_df.loc[f'{i}'] = [explained_variance, cumulative_variance]

variance_df.index.name = 'n_components'
variance_df.T
Out[41]:
n_components 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Explained Variance 0.32 0.14 0.11 0.08 0.06 0.05 0.04 0.04 0.03 0.03 0.03 0.02 0.02 0.02 0.01 0.00 0.00 0.00
Cumulative Explained Variance 0.32 0.46 0.56 0.64 0.70 0.75 0.80 0.83 0.87 0.90 0.92 0.95 0.97 0.99 1.00 1.00 1.00 1.00

Let's plot:

In [42]:
plt.figure(figsize=(6, 4))
sns.lineplot(data=variance_df, x=variance_df.index, y='Cumulative Explained Variance', marker='o', label='Cumulative Explained Variance')
plt.axhline(y=0.5, color='c', linestyle='--', label='0.5 Threshold')
plt.axhline(y=0.7, color='c', linestyle='--', label='0.7 Threshold')

plt.title('Explained Variance vs. Number of PCA Components')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.legend()
plt.tight_layout();
No description has been provided for this image

Note:

As the result in the function, we have n_components = [3, 4, 5] as Explained Variance is in 50-70% range.

Now let's move to the next criterion.

4.2.2. Criterion #2 - Kaiser Criterion: Eigenvalues > 1¶

In [43]:
variance_df['Eigenvalue'] = pca.explained_variance_
variance_df.T
Out[43]:
n_components 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Explained Variance 0.32 0.14 0.11 0.08 0.06 0.05 0.04 0.04 0.03 0.03 0.03 0.02 0.02 0.02 0.01 0.00 0.00 0.00
Cumulative Explained Variance 0.32 0.46 0.56 0.64 0.70 0.75 0.80 0.83 0.87 0.90 0.92 0.95 0.97 0.99 1.00 1.00 1.00 1.00
Eigenvalue 4.17 1.80 1.40 1.02 0.83 0.64 0.56 0.48 0.43 0.41 0.36 0.32 0.27 0.22 0.15 0.03 0.00 0.00
In [44]:
plt.figure(figsize=(8, 4))
sns.lineplot(data=variance_df, x=variance_df.index, y='Eigenvalue', marker='o')
plt.xticks(variance_df.index)
plt.axhline(y=1, color='r', linestyle='--', label='Kaiser Criterion: Eigenvalue > 1')
plt.title('Scree Plot with Kaiser Criterion')
plt.legend();
No description has been provided for this image
In [45]:
def find_components_eigenvalue_criterion(data):
    """
    Find the number of principal components based on the Kaiser Criterion (Eigenvalues > 1).

    Parameters:
        data (array-like): The scaled data.

    Returns:
        str: A string indicating the number of components based on the Kaiser Criterion.
    """
    cov_matrix = np.cov(data.astype(float), rowvar=False)
    eigenvalues = np.linalg.eigvals(cov_matrix)
    n_components = np.sum(eigenvalues > 1)

    return f"Number of components based on Kaiser Criterion (Eigenvalues > 1): {n_components}."
In [46]:
find_components_eigenvalue_criterion(scaled_data)
Out[46]:
'Number of components based on Kaiser Criterion (Eigenvalues > 1): 4.'

4.3. Conclusion about n_components¶

Criterion 1: Explained Variance (50-70%)

When using the criterion of explained variance, we aim to find the number of principal components that collectively capture a substantial portion of the variability in the data. In this case, we are interested in components that explain between 50% and 70% of the total variance in the dataset. Using the function find_components_near_threshold, we determine that for this range, the optimal number of components is 3, 4, and 5. These components strike a balance between dimensionality reduction and information retention.

Criterion 2: Kaiser Criterion (Eigenvalues > 1)

The Kaiser Criterion involves examining the eigenvalues associated with the principal components. Eigenvalues indicate the amount of variance explained by each principal component. A common guideline is to retain only those components with eigenvalues greater than 1. When applying this criterion using the function find_components_eigenvalue_criterion, we identify that the first 4 components have eigenvalues exceeding 1. These components contribute significantly to the variability in the data.

n_components = 4 to not only aligns with the Kaiser Criterion by retaining components with eigenvalues greater than 1, but it also satisfies the Explained Variance Criterion by capturing a substantial portion of the variability in the data.

In [47]:
pca = PCA(n_components=4)
pca.fit(scaled_data)
pca_final = pd.DataFrame(
    pca.transform(scaled_data), 
    columns = (
        [
            "PC1", 
            "PC2", 
            "PC3", 
            "PC4",      
        ]
    )
)
pca_final.head()
Out[47]:
PC1 PC2 PC3 PC4
0 2.57 1.97 1.60 0.08
1 -1.94 -0.90 -1.03 1.71
2 1.61 -0.37 -0.93 -0.88
3 -1.99 -1.12 0.24 -0.58
4 -0.25 0.30 -0.74 -0.41

Note: By reducing the dimensionality from 18 features to 4 principal components, we've effectively simplified the dataset while retaining the most critical information. This streamlined dataset is now more suitable for further analysis and modeling while avoiding the complexities associated with high-dimensional data.

5. Customer Segmentation using K-Means Clustering¶

In the next step of the analysis, I'll focus on customer segmentation using the K-Means clustering algorithm. To determine the optimal number of clusters (k) for our K-Means algorithm, I will employ the Elbow Method and the Silhouette Score to choose the optimal number of clusters.

Elbow Method:

  • Visual approach to determine optimal cluster count
  • Run K-Means for different k values
  • Plot WCSS against the number of clusters
  • Identify "elbow point" where WCSS reduction slows
  • Indicates balance between minimizing variance and overfitting

Silhouette Score:

  • Measures object similarity to its cluster and others
  • Score range: -1 to 1
  • Higher score means object matches its cluster well
  • Helps evaluate cluster quality and select best k value

5.1. Initialize model¶

In [48]:
ranges=range(1,11)
inertia = []
sil_scores = []
for n in ranges:
    kmeans = KMeans(n_clusters=n, 
                    n_init=10, 
                    algorithm ='lloyd',
                    random_state=42)
    kmeans.fit(pca_final)
    inertia.append(kmeans.inertia_)
    
    if n > 1:
        sil_score = silhouette_score(pca_final, kmeans.labels_)
        sil_scores.append(sil_score)
    else:
        sil_scores.append(0)

5.2. Choose k¶

In [49]:
plt.figure(figsize=(9, 4))
plt.subplot(1, 2, 1)
plt.plot(ranges, inertia, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k (Inertia)')

plt.subplot(1, 2, 2)
plt.plot(ranges, sil_scores, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score for Optimal k')
plt.tight_layout();
No description has been provided for this image

Another way to plot:

In [50]:
kmeans = KMeans(n_init=10, random_state=42)

elbow_visualizer = KElbowVisualizer(kmeans, k=(2,10))
elbow_visualizer.fit(pca_final)  
elbow_visualizer.show()
No description has been provided for this image
Out[50]:
<Axes: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
In [51]:
elbow_visualizer = KElbowVisualizer(kmeans, k=(2,10), metric='silhouette')
elbow_visualizer.fit(pca_final)  
elbow_visualizer.show();
No description has been provided for this image

5.3. Determining the Optimal Number of Clusters: Elbow vs. Silhouette¶

In our quest to determine the optimal number of clusters for customer segmentation, we employed two distinct methods:

  • Elbow Method: This method revealed a clear inflection point, or "elbow," at 4 clusters.

  • Silhouette Score: The Silhouette Score, which measures the quality of clustering, was highest for 2 clusters but also exhibited a substantial peak at 4 clusters.

Harmonization: To make an informed decision, I sought harmony between the insights gained from the Elbow Method and the Silhouette Score. Notably, we observed a convergence between the second peak at 4 clusters in the Silhouette Score and the elbow point at 4 clusters in the Elbow Method. This led us to the final decision of 4 clusters for our customer segmentation. This harmonious choice elegantly balances the reduction in inertia and the high quality of clustering.

In [52]:
kmeans = KMeans(n_clusters=4, 
                    n_init=10, 
                    random_state=42)
kmeans.fit(pca_final)
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
pca_final['Cluster'] = labels
df['Cluster'] = labels
In [53]:
fig = plt.figure(figsize=(10,7))
ax = plt.subplot(projection='3d')
ax.scatter(xs = pca_final["PC1"], 
           ys = pca_final["PC2"], 
           zs = pca_final["PC3"], 
           s=50, 
           c= pca_final["Cluster"], 
           marker='o', 
           alpha = 0.5, 
           cmap = 'magma')
plt.title("Static 3D Clusters");
No description has been provided for this image
In [54]:
fig = px.scatter_3d(pca_final, x='PC1', y='PC2', z='PC3', color='Cluster', opacity=0.1, color_continuous_scale='magma')
fig.update_layout(title='Interactive 3D Clusters with Centroids')

fig.add_scatter3d(x=centroids[:, 0], y=centroids[:, 1], z=centroids[:, 2],
                  mode='markers', 
                  marker=dict(size=7, color='red', symbol='x'), 
                  showlegend=False)

6. Evaluate the Clusters¶

In [55]:
df.head(2)
Out[55]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact Z_Revenue Response Year_Enrolled Accepted_Campaign Dependents Day_Enrolled Age Total_Cost Cluster
0 5524 1957 Graduation Single 58138.00 0 0 04-09-2012 58 635 88 546 172 88 88 3 8 10 4 7 0 0 0 0 0 0 3 11 1 2012-09-04 0 0 663 57 1617 2
1 2174 1954 Graduation Single 46344.00 1 1 08-03-2014 38 11 1 6 2 1 6 2 1 1 2 5 0 0 0 0 0 0 3 11 0 2014-03-08 0 2 113 60 27 1
In [56]:
sns.countplot(x=df['Cluster'], palette='magma', hue=df['Cluster'], legend=False);
No description has been provided for this image
In [59]:
df['Cluster'].value_counts()
Out[59]:
Cluster
1    988
3    596
0    440
2    216
Name: count, dtype: int64
In [57]:
pd.pivot_table(data=df, index='Cluster', values = corr.columns.to_list(), margins=True).T
Out[57]:
Cluster 0 1 2 3 All
Accepted_Campaign 0.22 0.08 1.61 0.24 0.30
Age 48.45 42.52 44.46 47.49 45.19
Day_Enrolled 488.57 308.32 407.59 309.37 353.58
Dependents 1.37 1.22 0.19 0.47 0.95
Income 52398.10 34484.70 77157.74 72518.36 52237.98
NumCatalogPurchases 2.61 0.52 6.11 5.00 2.66
NumDealsPurchases 4.91 1.86 1.23 1.58 2.33
NumStorePurchases 6.83 3.19 7.70 8.65 5.79
NumWebPurchases 6.45 2.02 5.75 5.16 4.08
NumWebVisitsMonth 6.82 6.37 3.63 3.07 5.32
Response 0.17 0.07 0.87 0.00 0.15
Total_Cost 653.01 91.22 1521.67 1092.04 605.80

Education and Marital_Status between Clusters¶

In [58]:
plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
sns.countplot(data=df, x= 'Education', hue='Cluster', palette='magma');
plt.subplot(1,2,2)
sns.countplot(data=df, x= 'Marital_Status', hue='Cluster', palette='magma')
plt.tight_layout();
No description has been provided for this image

Note:

Education and Marital_Status appear to vary both within and among the customer segments, which means that these variables alone may not be the primary drivers of segment differentiation. In the context of the clustering analysis performed, it appears that other factors such as age, income, purchase behavior, and response to marketing campaigns have played a more significant role in distinguishing the segments.

7. Final Conclusion¶

7.1. Customer Segmentation¶

Segment 0: The Loyal Veterans¶

  • Segment Size: This segment comprises 440 customers, representing the most long-standing customers of the company.
  • Demographics: The customers in this segment are relatively older, with an average age of around 48 years. They have a moderate income, approximately $50k per year, and typically have 1 to 2 dependents.
  • Website Visits: They exhibit frequent visits to the company's website, indicating active online engagement.
  • Purchase Behavior: These customers are active purchasers both in-store and on the website. They show a preference for discounted items.
  • Spending Habits: The segment has a moderate total spending (Total_Cost) of around $650 in the last 2 years.
  • Campaign Response: Despite their loyalty, they demonstrate a low response rate to marketing campaigns (0.17), suggesting that traditional campaign strategies may not be highly effective for this group.

Segment 1: The Budget-Conscious Shoppers¶

  • Segment Size: This is the largest segment with approximately 988 customers.
  • Demographics: The customers in this segment are relatively younger, with an average age of 42. They have a lower income, around $34k, and usually have around 1 dependent.
  • Purchase Behavior: Although they frequently visit the website, their purchasing activity is moderate, especially for catalog, deal, and store purchases.
  • Website Visits: They show a high frequency of website visits but make fewer purchases compared to other segments, possibly due to budget constraints.
  • Spending Habits: This segment has a low total spending (Total_Cost) of $91.2 in the last 2 years.
  • Campaign Response: They exhibit a low response rate to marketing campaigns (0.07).

Segment 2: The Affluent Aficionados¶

  • Segment Size: This segment consists of 216 customers, representing the smallest among the segments.
  • Demographics: Customers in this segment have an average age and remarkably high income, averaging $77k, and often have no dependents.
  • Purchase Behavior: They have a significant number of purchases through catalog, store, and web channels, with a reluctance to buy when there is a discount.
  • Website Visits: Despite infrequent website visits (around 3.6 visits per month), they are prolific purchasers.
  • Spending Habits: This segment exhibits a high total spending (Total_Cost) of $1521 in the last 2 years.
  • Campaign Response: They show a very high response rate to marketing campaigns (0.87), indicating a strong positive response to campaigns, with an average of 1.61 accepted campaigns per month.

Segment 3: The Silent High Rollers¶

  • Segment Size: This segment is composed of 596 customers.
  • Demographics: Customers in this segment are relatively older, with high incomes and typically 0 or 1 dependents.
  • Purchase Behavior: They are significant purchasers, engaging in numerous purchases per month.
  • Website Visits: Despite infrequent website visits (around 3 visits per month), they are active buyers.
  • Spending Habits: This segment exhibits a notable total spending (Total_Cost) of $1092, indicating substantial spending on various products in the last 2 years.
  • Campaign Response: Surprisingly, they never respond to marketing campaigns (0.00).

7.2. Recommendations¶

Segment 0: The Loyal Veterans¶

Recommendations:

  1. Loyalty Programs: Enhance and promote loyalty programs tailored to the preferences of this segment to further strengthen customer retention.
  2. Personalized Campaigns: Implement personalized marketing campaigns based on their historical preferences to improve campaign response rates.
  3. Engagement Initiatives: Explore initiatives to increase engagement on the website, potentially offering exclusive online promotions.

Segment 1: The Budget-Conscious Shoppers¶

Recommendations:

  1. Budget-Friendly Promotions: Design marketing campaigns focused on budget-friendly promotions and discounts to resonate with the financial constraints of this segment.
  2. Educational Content: Provide educational content on how to maximize savings and value, fostering a sense of financial empowerment.
  3. Customer Feedback: Collect and analyze customer feedback to understand specific pain points and areas for improvement in the purchasing experience.

Segment 2: The Affluent Aficionados¶

Recommendations:

  1. Exclusive Events: Organize exclusive events or product launches for this segment to capitalize on their high income and purchasing power.
  2. Premium Experiences: Offer premium and personalized shopping experiences, recognizing their preference for non-discounted items.
  3. Targeted Campaigns: Create targeted marketing campaigns that align with their unique preferences, emphasizing quality and exclusivity.

Segment 3: The Silent High Rollers¶

Recommendations:

  1. Reactivation Campaigns: Implement reactivation campaigns to re-engage this segment, as they have shown potential with high incomes and significant past spending.
  2. Customer Surveys: Conduct surveys to understand the reasons behind their lack of response to marketing campaigns, helping tailor future strategies.
  3. Strategic Partnerships: Explore partnerships with high-end brands or exclusive offerings to capture their attention and drive engagement.

General Recommendation:¶

  1. Cross-Segment Analytics: Conduct comprehensive analytics across segments to identify overarching trends and opportunities that may benefit the entire customer base. This can inform global marketing strategies and improve overall customer satisfaction.Events:** Organize exclusive events or product launches for this segment to capitalize on their high income and purchasing power.
  2. Premium Experiences: Offer premium and personalized shopping experiences, recognizing their preference for non-discounted items.
  3. Targeted Campaigns: Create targeted marketing campaigns that align with their unique preferences, emphasizing quality and exclusivity.
In [ ]: